Compare the results in OSU and computed here for the three yardsticks. How and why are they different?
In O’Sullivan and Unwin (2010) (OSU) they get a fractal dimension (D) of 1.44 for their measurements. In the rspatial package data example they get a D of 1.379. The difference in this is due to variations between the two datasets. For example, when the measure “stick” is 10 km, the OSU gets a coastline of 180km with 18 points while the rspatial example calculation gets a coastline of 390 km with 39 points. This reflects differences in the detail and/or scale variations between the data. As the rspatial package was compiled between 2016-2019, it could be likely that the data are newer in the package then from the OSU book that was published in 2010. It’s likely the maps are just different scales.
Create “matrices” from the book.
# independent variable
ind <- c(87, 95, 72, 37, 44, 24,
40, 55, 55, 38, 88, 34,
41, 30, 26, 35, 38, 24,
14, 56, 37, 34, 8, 18,
49, 44, 51, 67, 17, 37,
55, 25, 33, 32, 59, 54)
# dependent variable
dep <- c(72, 75, 85, 29, 58, 30,
50, 60, 49, 46, 84, 23,
21, 46, 22, 42, 45, 14,
19, 36, 48, 23, 8, 29,
38, 47, 52, 52, 22, 48,
58, 40, 46, 38, 35, 55)
Actually make them matrices using matrix.
#create a matrix from the ind and dep data
#byrow=TRUE because r does columwise by default
mi <- matrix(ind, ncol=6, nrow=6, byrow=TRUE)
md <- matrix(dep, ncol=6, nrow=6, byrow=TRUE)
Create these matrices from ind and dep without using byrow=TRUE. Hint: use the t function after you made the matrix.
Independent Variable Matrix
#create the columnwise matrix
mi.columnwise <- matrix(ind, ncol = 6, nrow = 6)
#compare the two matrices for the ind variable
mi
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 87 95 72 37 44 24
## [2,] 40 55 55 38 88 34
## [3,] 41 30 26 35 38 24
## [4,] 14 56 37 34 8 18
## [5,] 49 44 51 67 17 37
## [6,] 55 25 33 32 59 54
mi.columnwise
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 87 40 41 14 49 55
## [2,] 95 55 30 56 44 25
## [3,] 72 55 26 37 51 33
## [4,] 37 38 35 34 67 32
## [5,] 44 88 38 8 17 59
## [6,] 24 34 24 18 37 54
Transpose the columnwise matrix
#transpose it using t
mi.tpose <- t(mi.columnwise)
#compare to check, are they the same?
mi
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 87 95 72 37 44 24
## [2,] 40 55 55 38 88 34
## [3,] 41 30 26 35 38 24
## [4,] 14 56 37 34 8 18
## [5,] 49 44 51 67 17 37
## [6,] 55 25 33 32 59 54
mi.tpose
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 87 95 72 37 44 24
## [2,] 40 55 55 38 88 34
## [3,] 41 30 26 35 38 24
## [4,] 14 56 37 34 8 18
## [5,] 49 44 51 67 17 37
## [6,] 55 25 33 32 59 54
Dependent Variable Matrix
#create the columnwise matrix
md.columnwise <- matrix(dep, ncol = 6, nrow = 6)
#compare to the byrow=TRUE matrix
md
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 72 75 85 29 58 30
## [2,] 50 60 49 46 84 23
## [3,] 21 46 22 42 45 14
## [4,] 19 36 48 23 8 29
## [5,] 38 47 52 52 22 48
## [6,] 58 40 46 38 35 55
md.columnwise
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 72 50 21 19 38 58
## [2,] 75 60 46 36 47 40
## [3,] 85 49 22 48 52 46
## [4,] 29 46 42 23 52 38
## [5,] 58 84 45 8 22 35
## [6,] 30 23 14 29 48 55
Transpose the columwise matrix
#transpose it using t
md.tpose <- t(md.columnwise)
#compare to check, are they the same?
md
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 72 75 85 29 58 30
## [2,] 50 60 49 46 84 23
## [3,] 21 46 22 42 45 14
## [4,] 19 36 48 23 8 29
## [5,] 38 47 52 52 22 48
## [6,] 58 40 46 38 35 55
md.tpose
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 72 75 85 29 58 30
## [2,] 50 60 49 46 84 23
## [3,] 21 46 22 42 45 14
## [4,] 19 36 48 23 8 29
## [5,] 38 47 52 52 22 48
## [6,] 58 40 46 38 35 55
Instead of the mean function What other functions could, in principle, reasonably be used in an aggregation of raster cells?
Load the raster package
#we need the raster package for these
library(raster)
## Loading required package: sp
Turn the matrices into rasters
#uses the raster function
ri <- raster(mi)
rd <- raster(md)
Aggregation
#apply the mean funtion over two raster cells
ai1 <- aggregate(ri, c(2, 1), fun=mean)
ad1 <- aggregate(rd, c(2, 1), fun=mean)
Plot the Independent Raster
#change the view window to show the aggregate and the non agg
par(mfrow = c(1,2))
#plot ri
plot(ri, legend = FALSE)
text(ri)
#plot ai1
plot(ai1)
text(ai1)
#change view window back
par(mfrow = c(1,1))
We can use other functions within aggregation so long as the output is a number. For example, min, max, median, etc.
Here’s an example:
#aggregate by sum
ai.sum <- aggregate(ri, c(2,1), fun = sum)
Plot the original and the agg
#view window
par(mfrow = c(1,2))
#plot ri
plot(ri, legend = FALSE)
text(ri)
#plot ai.sum
plot(ai.sum)
text(ai.sum)
par(mfrow = c(1,1))
There are other ways to do the above (converting two RasterLayer objects to a data.frame). Show how to obtain the same result (d1) using as.vector and cbind.
Make a raster stack
#create a stacked raster
s1 <- stack(ai1, ad1)
#name the objects in the raster stack
names(s1) <- c('ind', 'dep')
#check it out
s1
## class : RasterStack
## dimensions : 6, 3, 18, 2 (nrow, ncol, ncell, nlayers)
## resolution : 0.3333333, 0.1666667 (x, y)
## extent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
## coord. ref. : NA
## names : ind, dep
## min values : 13.0, 18.5
## max values : 91.0, 73.5
#plot
par(mfrow = c(1,1))
plot(s1)
Create a dataframe
#the robert way
d1 <- as.data.frame(s1)
#compare the two
head(s1)
## ind dep
## [1,] 91.0 73.5
## [2,] 54.5 57.0
## [3,] 34.0 44.0
## [4,] 47.5 55.0
## [5,] 46.5 47.5
## [6,] 61.0 53.5
head(d1)
## ind dep
## 1 91.0 73.5
## 2 54.5 57.0
## 3 34.0 44.0
## 4 47.5 55.0
## 5 46.5 47.5
## 6 61.0 53.5
Create a dataframe part 2
#using as.vector to create two vectors and bind them with cbind
d1.bind <- cbind(as.vector(ai1), as.vector(ad1))
#look the same?
head(d1)
## ind dep
## 1 91.0 73.5
## 2 54.5 57.0
## 3 34.0 44.0
## 4 47.5 55.0
## 5 46.5 47.5
## 6 61.0 53.5
head(d1.bind)
## [,1] [,2]
## [1,] 91.0 73.5
## [2,] 54.5 57.0
## [3,] 34.0 44.0
## [4,] 47.5 55.0
## [5,] 46.5 47.5
## [6,] 61.0 53.5
#note that this creates a matrix not a dataframe, as.data.frame could be used
Show R code to make a cluster dendogram summarizing the distances between these six sites, and plot it. See ?hclust.
Create X,Y pts matrix
A <- c(40, 43)
B <- c(1, 101)
C <- c(54, 111)
D <- c(104, 65)
E <- c(60, 22)
F <- c(20, 2)
#row bind
pts <- rbind(A,B,C,D,E,F)
#look good?
head(pts)
## [,1] [,2]
## A 40 43
## B 1 101
## C 54 111
## D 104 65
## E 60 22
## F 20 2
Get distances
#dist gets distance matrix
dis <- dist(pts)
dis
## A B C D E
## B 69.89278
## C 69.42622 53.93515
## D 67.67570 109.11004 67.94115
## E 29.00000 98.60020 89.20202 61.52235
## F 45.61798 100.80675 114.17968 105.00000 44.72136
#convert to matrix
D <- as.matrix(dis)
round(D)
## A B C D E F
## A 0 70 69 68 29 46
## B 70 0 54 109 99 101
## C 69 54 0 68 89 114
## D 68 109 68 0 62 105
## E 29 99 89 62 0 45
## F 46 101 114 105 45 0
Create cluster dendogram
#use hclust, method in this instance irrelavent
#members refers to different clusters
dis.dendo <- hclust(dis, method = "ward.D2", members = NULL)
plot(dis.dendo)
Show how you can do ‘column-normalization’ (Just an exercise, in spatial data analysis this is not a typical thing to do).
First row normalization: Get a weights matrix
W <- 1 / D
round(W, 4)
## A B C D E F
## A Inf 0.0143 0.0144 0.0148 0.0345 0.0219
## B 0.0143 Inf 0.0185 0.0092 0.0101 0.0099
## C 0.0144 0.0185 Inf 0.0147 0.0112 0.0088
## D 0.0148 0.0092 0.0147 Inf 0.0163 0.0095
## E 0.0345 0.0101 0.0112 0.0163 Inf 0.0224
## F 0.0219 0.0099 0.0088 0.0095 0.0224 Inf
Row normalization:
#replace inf with NA
#inf comes from dividing by 0
W[!is.finite(W)] <- NA
Compute the row sums.
#rowSums
rtot <- rowSums(W, na.rm=TRUE)
# this is equivalent to
# rtot <- apply(W, 1, sum, na.rm=TRUE)
rtot
## A B C D E F
## 0.09989170 0.06207541 0.06763182 0.06443810 0.09445017 0.07248377
Do they add up to 1??
#divide by row totals, should be 1
W <- W / rtot
rowSums(W, na.rm=TRUE)
## A B C D E F
## 1 1 1 1 1 1
W is now row-normalized.
Let’s try columnwise:
#get a new weights matrix
cW <- 1 / D
#show round to 4 decimals
round(cW, 4)
## A B C D E F
## A Inf 0.0143 0.0144 0.0148 0.0345 0.0219
## B 0.0143 Inf 0.0185 0.0092 0.0101 0.0099
## C 0.0144 0.0185 Inf 0.0147 0.0112 0.0088
## D 0.0148 0.0092 0.0147 Inf 0.0163 0.0095
## E 0.0345 0.0101 0.0112 0.0163 Inf 0.0224
## F 0.0219 0.0099 0.0088 0.0095 0.0224 Inf
Replace the inf with NA
#replace inf
cW[!is.finite(cW)] <- NA
Sum the columns (I’m using apply here but colSums would also work)
#get column sums
ctot <- apply(cW, 2, sum, na.rm=TRUE)
ctot
## A B C D E F
## 0.09989170 0.06207541 0.06763182 0.06443810 0.09445017 0.07248377
We can’t simply divide in r because it treats data row-wise, so we can force columnwise division using sweep.
#use sweep to do division columnwise
cW <- sweep(cW, MARGIN = 2, FUN= "/", STATS = ctot)
#do they total 1?
colSums(cW, na.rm=TRUE)
## A B C D E F
## 1 1 1 1 1 1
The SpatialPolygonDataFrame class is defined in package sp. But we never used library('sp'). So how is it possible that we have one of such objects?
The
voronifunction comes from the packagedismo. Thedismopackage includes thesppackage in it’s NAMESPACE:import("raster", "sp", "methods")thus allowing forvoronito produceSpatialPolygonDataFrameobjects without thesppackage.
This exercise involves the Auto data set studied in the lab. Make surethat the missing values have been removed from the data.
#load the ISLR library
library(ISLR)
#load Auto into environment object
auto <- Auto
Which of the predictors are quantitative, and which are qualitative?
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
mpg,displacement,horsepower,weight,acceleration, are all “quantitative”, whilecylinders,yearare “categorical” and could be quatitative or qualitative depending on use, andnameandoriginare “qualitative” in that countries and names are not numbers.
What is the range of each quantitative predictor? You can answer this using the range() function.
#find the range use apply and the range function, subset only the quant vars
apply(auto[,1:7], MARGIN = 2, FUN = range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
What is the mean and standard deviation of each quantitative predictor?
#more apply using mean and sd
apply(auto[,1:7], MARGIN = 2, FUN = mean)
## mpg cylinders displacement horsepower weight
## 23.445918 5.471939 194.411990 104.469388 2977.584184
## acceleration year
## 15.541327 75.979592
apply(auto[,1:7], MARGIN = 2, FUN = sd)
## mpg cylinders displacement horsepower weight
## 7.805007 1.705783 104.644004 38.491160 849.402560
## acceleration year
## 2.758864 3.683737
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
#create matrices to make a table
r.sub <- apply(auto[-10:-85,1:7], MARGIN = 2, FUN = range)
m.sub <- apply(auto[-10:-85,1:7], MARGIN = 2, FUN = mean)
sd.sub <- apply(auto[-10:-85,1:7], MARGIN = 2, FUN = sd)
#bind together
table <- rbind(r.sub, m.sub, sd.sub)
#make dataframe
table.df <- as.data.frame(table)
#rename the rows
rownames(table.df) <- c("range x", "range y", "mean", "st.dev")
#show it pretty
round(table.df, 2)
## mpg cylinders displacement horsepower weight acceleration year
## range x 11.00 3.00 68.00 46.00 1649.00 8.50 70.00
## range y 46.60 8.00 455.00 230.00 4997.00 24.80 82.00
## mean 24.40 5.37 187.24 100.72 2935.97 15.73 77.15
## st.dev 7.87 1.65 99.68 35.71 811.30 2.69 3.11
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
#move the window so we can see the title and subtitle
par(mar = c(5, 4, 1, 2))
#pairswise matrix for variables to see colinearity
pairs(mpg ~ year + cylinders + displacement + horsepower + weight + acceleration, auto, na.action = na.omit)
title(main = "Pairs Scatterplots", sub = "for Auto dataset")
#reset the window
par(mar = c(5, 4, 4, 2) + 0.1)
#load ggplot to make pretty plots
library(ggplot2)
library(ggthemes)
#mpg ~ year
ggplot(auto) +
geom_jitter(aes(factor(year), mpg)) +
labs(title = "MPG by Year (1970-1982)",
subtitle = "ISLR Auto Dataset",
caption = "Tyler Jackson") +
theme_economist()
It looks like there is a positive relationship between mpg and year.
#acceleration ~ mpg
ggplot(auto) +
geom_jitter(aes(mpg, acceleration)) +
labs(title = "MPG by Acceleration (1970-1982)",
subtitle = "ISLR Auto Dataset",
caption = "Tyler Jackson") +
theme_economist()
It looks like mpg and acceleration time have a positive relationship.
#acceleration ~ mpg
ggplot(auto) +
geom_jitter(aes(weight, acceleration)) +
labs(title = "Weight by Acceleration (1970-1982)",
subtitle = "ISLR Dataset",
caption = "Tyler Jackson") +
theme_economist()
It looks like weight and acceleration are negatively correlated.
Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
Looking at the pairs scatterplot, it would appear that many factors when taken independently could be useful in predicting mpg. For instance, displacement, cylinders, horsepower, and weight, all have a negative relationship with mpg, while year and acceleration have a positive correlation. However, many of the variables are co-linear. Cylinders, displacement, horsepower, and weight all have a positive relationship with each other and negative relationship with mpg, so it would be hard to say one of those has more predictive power over another. Year looks better from the scatterplots, and could be a good predictor, as well as acceleration.
Looking at a correlation table:
round(cor(auto[,1:7], use = "complete.obs"), 2)
## mpg cylinders displacement horsepower weight acceleration
## mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42
## cylinders -0.78 1.00 0.95 0.84 0.90 -0.50
## displacement -0.81 0.95 1.00 0.90 0.93 -0.54
## horsepower -0.78 0.84 0.90 1.00 0.86 -0.69
## weight -0.83 0.90 0.93 0.86 1.00 -0.42
## acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00
## year 0.58 -0.35 -0.37 -0.42 -0.31 0.29
## year
## mpg 0.58
## cylinders -0.35
## displacement -0.37
## horsepower -0.42
## weight -0.31
## acceleration 0.29
## year 1.00
Here we can see that engine size and weight really have a high impact on mpg, with correlation coefficients ranging from -0.78 to -0.83 (in cylinders, displacement, horsepower, and weight). So while I can say it looks like mpg has improved over time, the engine really determines the mpg. Year also has a negative impact on engine size, which could explain the decrease in mpg over time.
This question involves the use of simple linear regression on the Auto data set.
Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.
#build a model with lm
fit.horse <- lm(mpg ~ horsepower, auto)
#get the results
summary(fit.horse)
##
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
For example: ###i. Is there a relationship between the predictor and the response?
Looking at the regression results, the p-values are nearly zero, indicating statistical significance. This means there is a relationship between the predictor and the response.
How strong is the relationship between the predictor and the response?
The coefficient for horsepower is -0.15, meaning that for each +1 change in horsepower, mpg decreases by 0.15. The R^2 indicates that 60% of the variation in the response can be explained by the model. This would indicate relatively strong explanatory power in the model.
Is the relationship between the predictor and the response positive or negative?
Negative. As horsepower increases, mpg should decrease.
What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
95% Confidence interval:
#we can use predict() to estimate Y at a given X
#the interval agrument allows a choice of the type of interval
predict(fit.horse,
data.frame(horsepower = c(98)),
interval ="confidence",
level = 0.95)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
The predicted mpg associated with hp of 98 is 24.46. The 95% confidence interval is 23.97 to 24.96.
95% Prediction Interval:
#note the interval agument here is 'prediction'
predict(fit.horse,
data.frame(horsepower = c(98)),
interval ="prediction",
level = 0.95)
## fit lwr upr
## 1 24.46708 14.8094 34.12476
The predicted mpg associated with hp of 98 is 24.46. The 95% prediction interval is 14.81 to 34.12.
Plot the response and the predictor. Use the abline() function to display the least squares regression line.
We can use ggplot to do this using the geom_smooth to the draw the abline:
#method = lm on geom smooth for a lm line
ggplot(auto) +
geom_point(aes(horsepower, mpg)) +
geom_smooth(aes(horsepower, mpg), method = "lm", se = FALSE) +
labs(title = "MPG and Horsepower", subtitle = "with least squares regression line", caption = "ISLR Auto data") +
theme_economist()
We can also do this old school with plot() and abline():
#plot the data
plot(mpg ~ horsepower, auto)
#the lm line
abline(fit.horse, lwd = 2, col = "red")
title(main = "MPG and Horsepower with LM line", sub = "for Auto dataset")
Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
#set up the viewing window 2x2
par(mfrow = c(2,2))
plot(fit.horse)
par(mfrow = c(1,1))
The first plot, residuals vs. fitted, indicates a non-linear relationship that is not explained by the model. The second plot(Q-Q) indicates a normal distribution in the residuals, not likely a problem there. The third plot (Scale-location) shows some level of homoscedastcitiy, however it looks like a lot of values cluster on the high end which could indicate non-equal variance. Finally, the last plot (Residuals vs. Leverage) we see no influential cases and the Cook’s distance lines are not visible, indicating no influential outliers.
This question involves the use of multiple linear regression on the Auto data set.
Produce a scatterplot matrix which includes all of the variables in the data set.
#change margins so the title doesn't overlap
par(mar = c(5, 4, 1, 2))
#using pairs
pairs(auto)
title(main = "Pairs Scatterplots", sub = "for Auto dataset")
#reset the margins to default
par(mar = c(5, 4, 4, 2) + 0.1)
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
#compute table all colums except the last
cor.table <- cor(auto[,1:8])
#use round to show to 2 decimal places
round(cor.table, 2)
## mpg cylinders displacement horsepower weight acceleration
## mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42
## cylinders -0.78 1.00 0.95 0.84 0.90 -0.50
## displacement -0.81 0.95 1.00 0.90 0.93 -0.54
## horsepower -0.78 0.84 0.90 1.00 0.86 -0.69
## weight -0.83 0.90 0.93 0.86 1.00 -0.42
## acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00
## year 0.58 -0.35 -0.37 -0.42 -0.31 0.29
## origin 0.57 -0.57 -0.61 -0.46 -0.59 0.21
## year origin
## mpg 0.58 0.57
## cylinders -0.35 -0.57
## displacement -0.37 -0.61
## horsepower -0.42 -0.46
## weight -0.31 -0.59
## acceleration 0.29 0.21
## year 1.00 0.18
## origin 0.18 1.00
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
#use . to indicate all variables and -name to exclude name
lm.fit <- lm(mpg ~ . -name, auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance:
Is there a relationship between the predictors and the response?
It seems that weight, year, and origin all have highly significant correlations, displacement is also significant but less so. The r-squared indicates that ~82% of the variation in the response can be explained by predictors in the model.
Which predictors appear to have a statistically significant relationship to the response?
Weight, year, origin (0.001), and displacement (.01).
What does the coefficient for the year variable suggest?
The coefficient in the year variable is 0.75, which means that for each increase in year, there is a 0.75 increase in mpg.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.
#set up the plot space
par(mfrow=c(2,2))
plot(lm.fit)
#reset the plot space
par(mfrow = c(1,1))
The first plot, residuals vs. fitted, indicates a non-linear relationship that is not explained by the model because of the non-random distribution of points. The second plot(Q-Q) indicates a normal distribution in the residuals, with a skew on the top end. The third plot (Scale-location) shows some level of homoscedastcitiy, however it looks like a lot of values cluster on the high end which could indicate non-equal variance. Finally, the last plot (Residuals vs. Leverage) we see no influential cases but one point (14) could have high leverage.
Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
All of the values in the Residuals vs. Leverage plot are within the Cook’s distance lines, but there is one value alone (14) that could have high leverage. Some values are higher than 4 or lower than -2 indicating outliers.
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
New model using interactions of all terms with 90% or higher correlation coefficient:
#add interaction terms using *
#same as a+b+a:b
fit.interactions <- lm(mpg ~ cylinders*displacement + displacement*weight + horsepower*displacement, auto)
summary(fit.interactions)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight + horsepower * displacement, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1308 -2.1597 -0.3652 1.9001 16.9864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.584e+01 2.569e+00 21.733 < 2e-16 ***
## cylinders 3.330e-01 8.190e-01 0.407 0.6845
## displacement -9.524e-02 1.605e-02 -5.935 6.59e-09 ***
## weight -3.803e-03 1.589e-03 -2.394 0.0172 *
## horsepower -1.844e-01 2.855e-02 -6.460 3.18e-10 ***
## cylinders:displacement 1.569e-03 3.581e-03 0.438 0.6615
## displacement:weight 4.258e-06 5.555e-06 0.766 0.4439
## displacement:horsepower 4.238e-04 9.786e-05 4.331 1.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.865 on 384 degrees of freedom
## Multiple R-squared: 0.7591, Adjusted R-squared: 0.7547
## F-statistic: 172.9 on 7 and 384 DF, p-value: < 2.2e-16
In this model, the interaction between displacement and horsepower is statistically significant. Other interactions included were not.
Try a few different transformations of the variables, such as log(X), √ X, X2. Comment on your findings.
#add terms that are transformed
fit.trans <- lm(mpg ~ .-name + I(horsepower^2) +
log(horsepower) + sqrt(horsepower),
auto)
summary(fit.trans)
##
## Call:
## lm(formula = mpg ~ . - name + I(horsepower^2) + log(horsepower) +
## sqrt(horsepower), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4353 -1.7487 -0.0456 1.3931 11.7105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.182e+02 1.706e+02 -1.865 0.063006 .
## cylinders -1.213e-01 3.285e-01 -0.369 0.712020
## displacement -3.336e-03 7.316e-03 -0.456 0.648684
## horsepower 6.901e+00 2.728e+00 2.529 0.011826 *
## weight -3.595e-03 6.715e-04 -5.354 1.49e-07 ***
## acceleration -2.722e-01 9.953e-02 -2.735 0.006528 **
## year 7.377e-01 4.522e-02 16.312 < 2e-16 ***
## origin 8.611e-01 2.528e-01 3.407 0.000728 ***
## I(horsepower^2) -4.864e-03 1.974e-03 -2.464 0.014168 *
## log(horsepower) 3.312e+02 1.473e+02 2.248 0.025123 *
## sqrt(horsepower) -1.868e+02 7.617e+01 -2.452 0.014664 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.944 on 381 degrees of freedom
## Multiple R-squared: 0.8614, Adjusted R-squared: 0.8578
## F-statistic: 236.8 on 10 and 381 DF, p-value: < 2.2e-16
Looks like compared to the first fit, this model improves with a higher R^2, but the transformed terms echo the significance of horsepower untransformed.
par(mfrow = c(2,2))
plot(fit.trans)
par(mfrow = c(1,1))
Looking at the residuals, it would seem that the response and fitted values have an almost linear relationship. The Q-Q plot shows some issues with distribution looking at the low and high ends, they seem to skew. The leverage plot indicates many values that could be outliers, but no influential cases.
Overall it’s hard to interperet the transformed variables without having some theoretical basis for transforming them.